Version: 7.7

Capacity Planning

Overview

This document provides general guidelines on how to plan initial server capacity, optimize system resources, and make provisions for system and environmental growth.

In general, capacity should be planned to handle average peak load. Resolve recommends that average peak load should be three times of the average load of the system. This is necessary for maintaining reasonable system performance when the highest peak load occurs.

As Resolve Actions Pro resource utilization characteristics may change, the system should be regularly monitored. The purpose of this monitoring is to detect shift in resource utilization so that acceptable performance levels are maintained. Additionally, it allows for more precise adjustment plans for any configuration or resources to increase or decrease capacity. Moreover, the initial architecture may not be the optimal architecture as the portfolio of Runbooks grows. Each Runbook make up utilizes different Actions Pro resource, so the architecture may need to be adjusted to avoid bottlenecks.

Factors That Affect Resource Utilization

The performance of a system depends on hardware and software configuration, network configuration, database design, and more. For Actions Pro, there are five key factors that affect resource utilization.

Burst Rate of ActionTasks

As incidents are received by Actions Pro, ActionTasks will be executed to resolve them. An ActionTask contains one or more commands that are executed against a target system. The number of ActionTasks implemented depends on the Runbook design. A high burst rate could result in a load that can be temporarily greater than the system capacity. As a result, a negative affect on the response time, Runbook synchronization, and performance should be expected. We typically observe that peak load is three to four times of the average load, and advise our customers to plan capacity based on a multiple of average load, what is termed the average peak load.

Number of Concurrent Users

As the number of concurrent user's increases, the system spends more resources handling user interactions. It places more load on the system, particularly RSView, and is an addition to the load initiated by user to resolve incidents. If you expect high number of concurrent users then plan for extra capacity to be utilized.

Total Events Processed

The total events processed over a period of time determines the amount of records, such as worksheets or social feed, which are generated and saved to the internal database. The amount of data in the internal database affects the response time of queries and the speed of backup and restore. Over time, the records could build up and require additional disk space.

Runbook Make Up

The Runbook make-up depends on the nature of the work. For example, organizations may set up a Runbook with a different focus. Some organizations may have a small number of highly complex Runbooks that each contain hundreds of ActionTasks. This will place heavy load on the RSControl. Other organizations may have hundreds of simple Runbooks, each set up to test a particular service, which will place heavy load on RSRemote. Additional information on this topic is available in the Optimization, Monitoring, and Future Planning section of this document.

Load from Ancillary Tasks

Ancillary tasks, such as archiving or import/export, can also have an impact on performance, if a task is not configured and scheduled properly. For example, archiving records during peak hour could slow down system performance and push the memory utilization to the limit, leading to unnecessary reboot. It is recommended that ancillary tasks are run during off hours. The additional load from these tasks should be taken into account when planning system capacity.

Capacity Planning

Capacity is determined by the total CPUs, memory, and disk space available to run Actions Pro, and is bound by the most limiting factor. Listed below are some representative configurations and guidelines.

Standard Configuration

The Standard configuration, by default, runs on a single host and is designed to handle a smaller load (usually has lower number of CPUs, typically 4). It may lack some advanced features such as High Availability, and therefore caps the maximum system throughput because no additional resources can be added to scale up. While the system may be efficient by itself, its potential is limited.

Performance Test Examples

The below test results (Tables 1 and 2) are based on simulated users performing the following actions:

All concurrent users were split between the above actions and ATs executed performed actions, such as accessing server, list directories, etc.

In the following setup, all Actions Pro components are deployed on a single VM (hardware specs: 4 vCPU, 32GB RAM, 40GB storage).

Table 1: Application Performance – Standalone Configuration

Standalone Actions Pro(1x VM)			Actions Pro 6.3	Actions Pro 6.4
	Concurrent users	Delay (API calls)	Total calls	Total calls
Login and get User	15	5 sec	15558	10429
Login, execute and getResultsMacro	30	5 sec	3374	20692
Login and view custom form	30	5 sec	15396	20863

Actions Pro cluster setup on 3 VMs (hardware specs: 8 vCPU, 32GB RAM, 40GB storage)

Table 2: Application Performance – Cluster Configuration

Actions Pro Cluster(3x VMs)			Actions Pro 6.3	Actions Pro 6.4
	Concurrent users	Delay (API calls)	Total calls	Total calls
Login and get User	50	5 sec	32104
Login, execute and getResultsMacro	50	6 sec	11711	28708
Login and view custom form	50	5 sec	34313	34645

Table 3: Total number of Runbooks executed during tests

Deployment	Actions Pro 6.3	Actions Pro 6.4
1x VM	3374	20692
3x VMs (Cluster)	11711	28708

Table 4: JVM memory allocation per component

Component	Xms	Xmx
RSSEARCH	8192	8192
RSVIEW	2048	2048
RSCONTROL	2048	2048
RSREMOTE	2048	2048
RSMGMT	512	512

It is recommended that no other components are setup on the RSMQ server for high throughput.

Summary

The executed tests are specific to system on which they were performed. With other Runbooks there will be different results.

The important conclusions from the above tests are:

RSControls and RSRemotes scale and distribute load in a perfect manner by each node getting the same amount of tasks.
The bottleneck in the example case is the network traffic to the DB, which hits the limit on the machine (6-7 Gbs/s).
With 5 node cluster, the load on some of the machines was ~30% - resources utilization was not good, because of the DB throttling. Probably with some tuning, we could get similar results with less nodes in the cluster.
ES uses a lot of CPU - in the example case it averages 5-6 CPU cores at 100% during the tests.

Figure 1 - 6.3 vs 6.4 performance tests

REST Performance Results

Figures 2 and 3- REST response times 6.3 vs. 6.4 for 500 users / REST response times 6.3 vs. 6.4 for 300 users

Figure 4 - SOAP response times 6.3 vs 6.4 for 1 user

Figure 5 - SOAP response times 6.3 vs 6.4 for 10 users

Figure 6 - SOAP response times 6.3 vs 6.4 for 20 users

Figure 7 - SOAP response times 6.3 vs 6.4 for 40 users

Figure 8 - SOAP response times 6.3 vs 6.4 for 300 users

Figure 9 - SOAP response times 6.3 vs 6.4 for 500 users

Performance Comparison With Previous Actions Pro Versions

Specific Runbook (40_Noop) was used to load RSControl only by eliminating any network I/O or external dependencies from third parties. The following test simulates high peak burst load to Actions Pro versions 6.3 and 6.4 running on a single VM. For high demand workloads, redundancy or in production environments we do NOT recommend running all Actions Pro components on a single VM.

Below is a rough comparison between Actions Pro v.6.3 and v.6.4 performed on a single VM (16 cores, 32GB RAM) executing the 40_Noop Runbook.

Table 5: Comparison results

	Actions Pro 6.3	Actions Pro 6.4	Actions Pro 7.0
RB/s	7.84	36.92	38.09
AT/s	330	1550	1524

Table 6: Hardware requirements

OS	CPU	Memory	HDD (storage)	HDD (provisioned IOPS)	Network interface
CentOS 8	16	32	500	1000	1 Gbit

Table 7: JVM memory allocation per component

Component	Xms	Xmx
RSSEARCH	16384	16384
RSMQ⁴	~23552	~23552
RSVIEW	2048	4096
RSCONTROL	2048	4096
RSREMOTE	2048	4096
RSMGMT	512	1024
RSSYNC	1024	1024
RSARCHIVE	2048	4096
LOGSTASH	1024	1024

The cluster layout consists of a Primary and Secondary instance of RSMQ, RSView. The Actions Pro components which scale horizontally in the following tests are RSControl, RSRemote and RSSearch data nodes.

Scaling improvements from previous Actions Pro versions

Table 8: Scaling to 5 and 10 or even 20 RSControl instances in comparison with Actions Pro v.6.3:

	Actions Pro 6.3	Actions Pro 6.4	Actions Pro 7.0
# of RSControls	5 10 20	5 10 20	5 10
RB/s	16 23 -	81 172 329	63 141
AT/s	3520 5060	17820 38184 73038	14092 31302

No data is available of Actions Pro v.6.3 on 20 nodes as beyond 10 nodes throughput actually degraded.

Table 9: In a real-world scenario results WILL vary in any direction

A good example is a customer's set of Runbooks provided and measured in a similar manner.

	Actions Pro 6.3	Actions Pro 6.4
# ofRSControls	3 5 10	5 10
RB/s	3.38 4.67 5	34.17 69.78
AT/s	750 1036 1110	7586 15491

The above tests take into account simulated network I/O wait time but latency to third party systems will influence the end results. The same should be taken into account when configuring Actions Pro for optimal performance.

High-Performance Cluster

Important changes were introduced in version 6.4 of Actions Pro to allow better scalability for high throughput demands. The following recommendations are only a guideline. The exact requirements may vary for each use case and should be adjusted accordingly. A high performing Actions Pro cluster deployment should include dedicated nodes for each component whenever possible. This is especially valid for RSMQ and RSSEARCH as they scale differently.

RSMQ is CPU intensive, but queues are single threaded and would run on low throughput disks with small pool of local storage. The number of CPU cores should be sized according to the number of RSControl/RSRemote nodes in the cluster. High performing deployment may require 16 CPU cores at 2.0Ghz or greater speed.

RSMQ CPU cores = # of RSControl nodes + 2

* Total number of RSControl nodes in a cluster

If possible, RSMQ servers should be dedicated to RSMQ and no other components should be present. For general usage, RSMQ is configured to flush messages to disk when it reaches 40% of the available physical memory on the server. The operation is I/O intensive and may severely impact further execution.

In extreme occasions under heavy peak load, slower disks or underprovisioned servers it may force a failover to the standby RSMQ instance. This could be safely tuned to 70% if no other components coexist leaving 30% of the available physical memory to OS for caching (vm_memory_high_watermark).

The RSMQ management by default shows message rates for each queue, channel and exchange. For best possible performance out of a CPU-bound server they can be disabled, f.e.:

RSSearch

For high load demand, RSSEARCH nodes can scale horizontally to 5 or more servers depending on the use case. A holistic approach is required to determine optimal settings. RSSEARCH nodes can be added to the cluster at runtime when needed. If heavy I/O is observed, we recommend RSSEARCH being tuned to fsync its transaction log on a set interval (fox example, 5 s).

RSControl

Tuning RSCONTROL is crucial to ensure optimal resource utilization across the cluster. Primary control mechanism to control the flow of work is achieved by the MAXEXECUTIONEVENTS property. Values may vary from 64 to 2048 depending on type of Runbooks, dependency on third parties etc. Load to Elasticsearch can be further offloaded by a enabling a bulk index processor controlled by the following properties in blueprint.properties:

rssearch.concurrentrequests - enables/disables concurrent requests to the ElasticSearch search client that is being used for batching requests to ES.
rssearch.bulkactions - number of entities kept in the ElasticSearch client before it flushes it to ES
rssearch.bulksize - size in MB that will be kept in the ElasticSearch client before it flushes its content to ES
rssearch.flushinterval - interval which ElasticSearch Bulk client will wait before it flushes its content to ES

Sample Actions Pro Cluster Layout For More Than 1 000 000 AT/m

The following is a sample layout of Actions Pro cluster to achieve a throughput of 1 000 000 AT/m. The setup below uses a total of 15 VMs with each Actions Pro component grouped to scaling RSSEARCH, RSCONTROL and RSREMOTE by adding instances of the same type. Use for reference - different automations or integrations with third parties may require different resources.

Table 10: Hardware requirements

Group	Instance type	Actions Pro component	Instance count	CPU cores³	Memory (GB)	Storage (GB)	Network interface
Group A	Dedicated¹	RSMQ	2	16	32	500	1 Gbit
Group B	Dedicated¹	RSSEARCH, RSMGMT	3	16	32	500	1 Gbit
Group C	Shared²	RSCONTROL, RSREMOTE, RSMGMT	7	8	8	500	1 Gbit
Group D	Shared²	RSCONTROL, RSREMOTE, RSMGMT, RSVIEW	2	8	16	500	1 Gbit
Group E	Shared²	RSCONTROL, RSREMOTE, RSMGMT, RSSYNC, RSARCHIVE, LOGSTASH	1	8	24	500	1 Gbit

Table 11: Components grouped by instance type

Table 12: JVM memory allocation per component

Component	Xms	Xmx
RSSEARCH	16384	16384
RSMQ⁴	~23552	~23552
RSVIEW	2048	4096
RSCONTROL	2048	4096
RSREMOTE	2048	4096
RSMGMT	512	1024
RSSYNC	1024	1024
RSARCHIVE	2048	4096
LOGSTASH	1024	1024

The setup above has been confirmed to support up to 20 RSCONTROL instances with 5 RSSEARCH nodes before it starts saturating the 1Gbit link.

Sample Actions Pro Cluster Layout For 500 000 AT/m

The following is a sample layout of Actions Pro cluster to achieve a throughput of 500 000 AT/m. The setup below uses a total of 10 VMs with each Actions Pro component grouped to scaling RSSEARCH, RSCONTROL and RSREMOTE by adding instances of the same type . Use for reference - different automations or integrations with third parties may require different resources.

Table 13: Hardware requirements

Group	Instance type	Actions Pro component	Instance count	CPU cores³	Memory (GB)	Storage (GB)	Network interface
Group A	Dedicated¹	RSMQ	2	16	32	500	1 Gbit
Group B	Dedicated¹	RSSEARCH, RSMGMT	3	16	32	500	1 Gbit
Group C	Shared²	RSCONTROL, RSREMOTE, RSMGMT	2	8	8	500	1 Gbit
Group D	Shared²	RSCONTROL, RSREMOTE, RSMGMT, RSVIEW	2	8	16	500	1 Gbit
Group E	Shared²	RSCONTROL, RSREMOTE, RSMGMT, RSSYNC, RSARCHIVE, LOGSTASH	1	8	24	500	1 Gbit

Instance type : Dedicated

Table 14: Components grouped by instance type

Table 15: JVM memory allocation per component

Component	Xms	Xmx
RSSEARCH	16384	16384
RSMQ⁴	~23552	~23552
RSVIEW	2048	4096
RSCONTROL	2048	4096
RSREMOTE	2048	4096
RSMGMT	512	1024
RSSYNC	1024	1024
RSARCHIVE	2048	4096
LOGSTASH	1024	1024

High Availability Configuration

The typical High Availability configuration has three times more CPUs (typically 12). The built-in High Availability module can distribute the load evenly among several hosts which leads to a much larger system capacity. It generally runs on a cluster of three hosts and provides additional benefits such as higher tolerance to failure and flexibility for configuration.

RSMQ CPU cores = # of RSControl nodes + 2

* Total number of RSControl nodes in a cluster

The RSMQ management by default shows message rates for each queue, channel and exchange. For best possible performance out of a CPU-bound server they can be disabled, f.e.:

RSSearch

For high load demand, RSSEARCH nodes can scale horizontally to 5 or more servers depending on the use case. A holistic approach is required to determine optimal settings. RSSEARCH nodes can be added to the cluster at runtime when needed. If heavy I/O is observed, we recommend RSSEARCH being tuned to fsync it's transaction log on a set interval (for example, 5s).

RSControl

rssearch.concurrentrequests - enables/disables concurrent requests to the ElasticSearch search client that is being used for batching requests to ES.
rssearch.bulkactions - number of entities kept in the ElasticSearch client before it flushes it to ES
rssearch.bulksize - size in MB that will be kept in the ElasticSearch client before it flushes its content to ES
rssearch.flushinterval - interval which ElasticSearch Bulk client will wait before it flushes its content to ES

Sample Actions Pro Cluster Layout For More Than 1 000 000 AT/m

Table 16: Hardware requirements

Group	Instance type	Actions Pro component	Instance count	CPU cores³	Memory (GB)	Storage (GB)	Network interface
Group A	Dedicated¹	RSMQ	2	16	32	500	1 Gbit
Group B	Dedicated¹	RSSEARCH, RSMGMT	3	16	32	500	1 Gbit
Group C	Shared²	RSCONTROL, RSREMOTE, RSMGMT	7	8	8	500	1 Gbit
Group D	Shared²	RSCONTROL, RSREMOTE, RSMGMT, RSVIEW	2	8	16	500 GB	1 Gbit
Group E	Shared²	RSCONTROL, RSREMOTE, RSMGMT, RSSYNC, RSARCHIVE, LOGSTASH	1	8	24	500	1 Gbit

Table 17: Components grouped by instance type

Table 18: Hardware requirements

Group	Instance type	Actions Pro component	Instance count	CPU cores³	Memory (GB)	Storage (GB)	Network interface
Group A	Dedicated¹	RSMQ	2	16	32	500	1 Gbit
Group B	Dedicated¹	RSSEARCH, RSMGMT	3	16	32	500	1 Gbit
Group C	Shared²	RSCONTROL, RSREMOTE, RSMGMT	2	8	8	500	1 Gbit
Group D	Shared²	RSCONTROL, RSREMOTE, RSMGMT, RSVIEW	2	8	16	500 GB	1 Gbit
Group E	Shared²	RSCONTROL, RSREMOTE, RSMGMT, RSSYNC, RSARCHIVE, LOGSTASH	1	8	24	500	1 Gbit

Table 19: JVM memory allocation per component

Component	Xms	Xmx
RSSEARCH	16384	16384
RSMQ⁴	~23552	~23552
RSVIEW	2048	4096
RSCONTROL	2048	4096
RSREMOTE	2048	4096
RSMGMT	512	1024
RSSYNC	1024	1024
RSARCHIVE	2048	4096
LOGSTASH	1024	1024

Optimization, Monitoring, and Future Planning

Optimization

Depending on the Runbook design or the number of concurrent users, different components of Actions Pro will have a different degree of load, which means that certain optimization may be required.

Monitoring

Actions Pro performance should be continuously monitored and periodically re-evaluated to detect any issue with the current setup. It is also necessary for extrapolating performance trends to assist with future capacity planning. It is recommendable for customers to perform:

Benchmark testing when Actions Pro is installed
Updated or upgraded
When the initial set of Runbooks is deployed
After a major feature, such as archiving is turned on, or on an average of every three months.

The following metrics are important for gauging performance and planning for future capacity. Actions Pro provides this information under "Admin Reports".

Server Load Average Over 1, 5, 15 minutes - The Server Load Average is one of the key indicators of Actions Pro capacity. The system is overloaded if the load average is higher than the number of CPUs indicated by the operating system (e.g. cat /proc/cpuinfo). Generally, Actions Pro should run at 30-40% load to allow for burst execution loads from event storms, etc.
The Transaction Count indicates the volume of executions processed. It can be used to trend usage loads for capacity planning as well as identifying burst ratios (peak/average load) load profiles. The burst ratio is useful to determine required overhead capacity to manage burst traffic loads.
The Runbook (startup) Latency indicates the average time taken to start the execution of automations. If this value is consistently increasing, it could indicate that the server may not have sufficient resources. Correlated with Load Average and Transaction Count, it is a good indicator of near-term capacity.
Check if JVM memory allocated is close to the limit. Be sure to allow for when bursts occur in execution load which may cause the component to get Out-Of-Memory (OOM). This is one of the key tuning parameters to configure for new deployments and major changes in execution load profiles.
Thread Count – JVM is typically the maximum thread pool size. The system maximum thread pool size do not need to be adjusted unless there is heavy load and the system needs to be optimized for better efficiency. Increasing both values will typically provide more concurrency and performance. However, beyond a certain optimal threshold, the performance will degrade.
Active Users / Average Response Time - If there is a large number of Actions Pro users or the number of concurrent users is growing, it may be necessary to monitor the number of Active Users and ensure that Average Response Time does not increase beyond acceptable thresholds.

Figure 10: Sample Admin Reports

Future Planning

To plan for future capacity, the following factors should be taken into consideration:

All the factors discussed above that could impact performance.
- Will any of the factors change?
- Is every assumption still valid?
Performance of existing Actions Pro system.
- The latest performance benchmark provides actual performance when compared to what was used in the initial planning.
- Are the initial expectations similar to the actual benchmark?
Change to existing Actions Pro configuration. You must take into account any ancillary tasks that will be activated, such as archiving, due to larger data set.
Change to the environment that Actions Pro operates.
- Is there any new system that Actions Pro has to work with?
- Is there any existing system that has grown in size?

Summary

Capacity planning is better executed when holistic approach is applied. The historical records of performance benchmark provide the best guidance on how to scale up. To minimize risk of under-provisioning, it is recommendable to take incremental steps in introducing changes, and adjust capacity based on the updated performance metrics.

It is recommended that no other components are setup on the RSMQ server for high throughput. If possible adjust vm_memory_high_watermark to 70% or ~23GB off of 32GB available memory.↩
Multiple Actions Pro components can co-exist on the same server.↩
Single Actions Pro component should be deployed↩
CPU cores at 2.0Ghz or greater speed.↩

Overview​

Factors That Affect Resource Utilization​

Burst Rate of ActionTasks​

Number of Concurrent Users​

Total Events Processed​

Runbook Make Up​

Load from Ancillary Tasks​

Capacity Planning​

Standard Configuration​

Performance Test Examples​

Summary​

REST Performance Results​

Performance Comparison With Previous Actions Pro Versions​

Scaling improvements from previous Actions Pro versions​

High-Performance Cluster​

RSSearch​

RSControl​

Sample Actions Pro Cluster Layout For More Than 1 000 000 AT/m​

Sample Actions Pro Cluster Layout For 500 000 AT/m​

High Availability Configuration​

RSSearch​

RSControl​

Sample Actions Pro Cluster Layout For More Than 1 000 000 AT/m​

Optimization, Monitoring, and Future Planning​

Optimization​

Monitoring​

Future Planning​

Summary​

Overview

Factors That Affect Resource Utilization

Burst Rate of ActionTasks

Number of Concurrent Users

Total Events Processed

Runbook Make Up

Load from Ancillary Tasks

Capacity Planning

Standard Configuration

Performance Test Examples

Summary

REST Performance Results

Performance Comparison With Previous Actions Pro Versions

Scaling improvements from previous Actions Pro versions

High-Performance Cluster

RSSearch

RSControl

Sample Actions Pro Cluster Layout For More Than 1 000 000 AT/m

Sample Actions Pro Cluster Layout For 500 000 AT/m

High Availability Configuration

RSSearch

RSControl

Sample Actions Pro Cluster Layout For More Than 1 000 000 AT/m

Optimization, Monitoring, and Future Planning

Optimization

Monitoring

Future Planning

Summary